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Abstract. The present work analyzes the redundancy of sets of combinatorial objects produced 
by a weighted random generation algorithm proposed by Denise et al. This scheme associates 
weights to the terminals symbols of a weighted context-free grammar, extends this weight defini- 
tion multiplicatively on words, and draws words of length n with probability proportional their 
' weight. We investigate the level of redundancy within a sample of fc word, the proportion of the 

total probability covered by k words (coverage), the time (number of generations) of the first col- 
^ , lision, and the time of the full collection. For these four questions, we use an analytic urn analogy 

Qto derive asymptotic estimates and/or polynomially computable exact forms. We illustrate these 
tools by an analysis of an RNA secondary structure statistical sampling algorithm introduced by 
' Ding et al. 
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1. Introduction 



The random generation of combinatorial objects is both motivated by the exploration of complex 
objects, the empirical assessment of statistical properties and by its applications to numerous fields 
(analysis of data structures and algorithms p], software testing [6j[5], bioinformatics 9 . . . ). Many 
approaches have been developed to address the uniform random generation of combinatorial objects 
of a given size. Historically, the recursive method, formalized by Wilf [24], starts by efficiently pre- 
computing the numbers of objects accessible from local choices, and uses these numbers during 
the generation to perform an uniform random generation as an unbiased walk. This approach was 
later extended and made fully automatic by Flajolet et al |15j for all decomposable combinatorial 
classes, i. e. classes that are specified constructively within the symbolic framework as opposed to 
implicitly defined by a required property. Finally Duchon et al recently relaxed this scheme 
through Boltzmann sampling. 

Yet certain contexts require a non-uniform - yet controlled - distribution to be captured, giving 
rise to various approaches [3] for the non-uniform generation. Denise et al [7J introduced weighted 
context-free grammars where a weight function, defined on the terminals and extended multiplica- 
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tively on words, induces a Boltzmann distribution over each subset of words of a given length n. 
The resulting languages are then used as models for objects following non-uniform distributions, 
of which natural instances can be found in bioinformatics |21j . An adaptation of the recursive 
method was proposed [7J to draw words of a given size n with respect to a weighted distribution. 
Multidimensional Boltzmann versions of the weighted samplers were also proposed for weighted 
languages by Bodini et al (3]. 

However weighted distributions, by assigning probabilities to possible words that scale exponen- 
tially within a class of size, may induce a - possibly large - redundancy within sampled set of 
words. Since the probability of a word is exactly and efficiently computable such a redundancy is 
not informative and should be avoided. Furthermore, if a non-redundant sample of given cardinality 
k is expected, one may find situations where the complexity of generating k distinct words using a 
rejection approach becomes heavily dominated by the rejection step. Finally, the proportion of the 
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distribution contained within a sampled set may be affected, positively or negatively, by the adjunc- 
tion of weights. One of the authors proposed a non-redundant version of the recursive method [20] 
to work around the first issue. However the question of the dependency between the weights and 
the level of redundancy was left open in a general setting. 

The aim of the current work is to analyze the redundancy and coverage of a weighted sampled 
set of words. To tackle these questions, one can reformulate the repeated generation of words 
within a weighted language as a random allocation of balls into urns. Namely each word w in C n 
the restriction of the language to words of length n will correspond to an urn having probability 
proportional to the weight of w. A list of questions naturally arise which can be rephrased into 
classic random allocations problems: 

(1) How many words are required before some word is drawn twice? This is a weighted instance 
of the Birthday paradox (the first 2-birthday [13 ) . 

(2) How many words must be sampled before each word in C n is encountered at least once? 
One finds in the above formulation the Coupon collector problem. 

(3) How many distinct words are there after sampling k words? This is equivalent to the 
expected number of urns having positive load after throwing k balls. 

(4) What is the coverage, i.e. the cumulated weight/probability of a non-redundant sampled 
set after k generations? This last problem rephrases as the cumulated weight/probability 
of urns having positive load after throwing k balls. 

In this paper, we address and provide closed formulae and/or asymptotic estimates for these four 
statistical quantities under natural conditions of non-degeneracy, and illustrate our results with an 
analysis of a statistical sampling algorithm used to predict the folding of RNA. After this short 
introduction we remind in Section [5] some basic notions related to context-free grammars, languages, 
algebraic functions and their weighted analogs. In Section [3l we state our main results on weighted 
context-free languages in the form of four theorems dedicated to the four questions above. General 
results on weighted urns models are established or recalled in Section 01 of which our theorems 
are direct corollaries. We apply in Section [5] our theorems to an analysis of a statistical sampling 
algorithm used to predict the functional folding of RNAs, using the fact that the three-dimensional 
structure of an RNA can be modeled by a secondary structure, i. e. a word of a Motzkin-likc 
context-free language. We conclude with some possible extensions of the current work. 

2. Definitions and notations 

2.1. Weighted context-free languages. Throughout the rest of the document, n will stand for 
the length of generated words. For the sake of self-containment, let us start by recalling some 
definitions found in Denise et al [?]• 

A weighted context-free grammar is a 5-tuple (it, T^,Af,V,S) such that 

• S is the alphabet, i.e. a finite set of terminal symbols. 

• M is a finite set of non-terminal symbols. 

• V is the finite set of production rules, each of the form N — > X, for N € Af any non-terminal 
and X e {SUA/"}*. 

• S is the axiom of the grammar, i. e. the initial non-terminal. 

• 7r is a positive weight vector it = (~Kt)teT., assigning positive weights to each letter i, G S. 

Let us further assume that the input grammar is unambiguous. This is a real limitation, however 
a similar analysis for intrinsically ambiguous languages is rather challenging since the associated 
generating functions are not necessarily algebraic but possibly transcendental [12) . 

Let us denote by C be the language generated from the axiom of Q n , and by C n its restriction 
to words of size n. One can extend the weight multiplicatively on any word w £ C such that 

tew 
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This gives rise to the notion of weighted generating function L 7T (z) for a context-free language 
£, a natural generalization of the ordinary generating function where each word is counted with 
multiplicity equal to its weight 



wEC n>0 

where p n ,-n = ^2 w eC 7r ( u; ) is the total weight of words of size n. In particular, the number m n of 
words of size n can be also expressed as m n = \C n \ = p n ,i- 

The weighting scheme then defines a weighted distribution on C n through 

™/ x ir(w) tt(w) 

Finally let us define the /c-th moment of a 7r-weighted distribution as 



«fe,n = Z^Pi = — — = 7T^- 



i— 1 ' ' 



2.2. Asymptotics of coefficients. The (weighted) generating function of an unambiguous context- 
free language is a positive solution of an algebraic system of equations, therefore its singularities 
are algebraic. Let us first assume that the dominant singularity p w is unique. 

Then, for any fixed tt, the coefficients of L 7r {z) admit an asymptotic equivalent of the form 

(2.2) [z n ] U{z) = /v„ ~ ^ ■ p- n ■ n- k - (l + 0(n- fc -)) , 

for £ (0, 1], n n some positive real value, and k^k^ some positive rational numbers such that 
k' v > 0. The asymptotic equivalent for the number of words m n = \C n \ — [z n ] L(z) is obtained as 
a special case of the above, yielding 

(2.3) m n = \£ n \ = [z 11 ] L(z) ~ k ■ p~ n ■ n~ k (l + Oin^') 



with p := px, k := k±, k := fci and k! := k' x > defined as above. 

If the assumption on the unicity of the dominant singularity does not hold, then different sin- 
gularities may be found on the circle of radius p % . In this case the coefficients of the generating 
functions do not admit an universal expansion of the form described in Equation 12.21 since the 
contributions of various singularities may cancel out. 



2.3. Weight classes. Let us denote by W„ the vector of all possible and distinct weights within 
C n ordered increasingly (W n ,i < W n ,i+i). In particular, let W^ n := W n< i (resp. W^ n := W ni \<w n \) 
be the minimal (resp. maximal) weight of a word within £„. We denote by m n: i C C n the class 
of words having weight W n % and by m n ^ — |m nj , | its cardinality. 



3. Main results 

Let Gtt be a weighted context-free grammar generating a language £, tt its a weight vector and 
neNa length. Remind that Wj n and W„ n are the minimal and maximal weight of a word in C n 
respectively. Let p^ be the dominant singularity of L w (z), and consider the following conditions: 

CI Diversity: Let :— W^ n j p^.n be the probability of the most probable word within C n 

with respect to a weight function tt, then there exists /3 > 1 such that p^ n £ 0((3~ n ). 
C2 Log-positive weights: For each terminal symbol t £ S, %t > 1. 

C3 Bounded dependency: For any rational number k > 1 and any weight vector tt such that 
Condition IC2I holds. p n k < p n k holds. 
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Theorem 3.1 (First collision). Under conditions IC 11 [C2l and \C3[ the expected number of gener- 
ations E[B n ^\ before some word of C n is drawn twice is such that 

(3-1) %,J~^=^ 6 fl( 7 «) ) T-=-l=> 1 

Theorem 3.2 (Full collection). The expected number of generations E[C n ^] before all the words 
in C n are generated at least once is such that 

(3-2) l^<E[C ni «]<2.H mn -^ 
which, for large values of n, adopts the equivalent 

( o „n ■ P^ n ^ w[n i ^ 2 ■ log(l/p) ■ K n ■ p- n 

Moreover in the uniform distribution (it — 1) the above expression simplifies into 
(3.4) E[C nA ] = m n ■ U mn ~ K - l0 S(\/p)-p- n ( 1 + ( 1/n » 



Theorem 3.3 (Distinct samples). The expected number E[N n ,ir,k] of distinct words obtained after 
k generations is such that 

(3.5) E[N n ^ k ] = £ m^i ■ - (l - j^j \=fl m ->* ■ ( X " e~^ k ^j + 0(1). 

Theorem 3.4 (Coverage). In a weighted distribution, the expected cumulated probability E[P niWt k] € 
[0, 1] o/ the set of distinct words obtained after k generations is given by 

|W| / , >-N 

W n ,i L L W n , 



(3.6) E[P n ^ k ] = ]T m nti • • 1 - 1 - 

. j Pir,n \ \ Pir,n 

Moreover if Condition \Cl\ is satisfied, then there exists (3 > 1 smc/i i/iai, /or any fc € o(/3 n ), one 
has 

(3.7) £[*W1 = fc • a 2 ,» (1 + 0(/3~")) ■ 

Remark that there are at most (n + l)l s different compositions/classes of weights, and therefore 
Theorems 13.31 and 13.41 immediately suggest polynomial time algorithms for computing the expected 
number of distinct words and coverage respectively. 

3.1. Discussing the loss of generality. Let us discuss the loss of generality induced by the above 
conditions: 

• Condition IC1I requires that no polynomial group of words contribute asymptotically to a 
significant part of the weighted distribution. This is the typical case in weighted languages, 
as the exponential growth of fj-^.n usually arises as a cooperation between the natural 
combinatorial explosion of the numbers of words and their individual weights. However 
this condition is restrictive, and discards languages of polynomial growth, or grammars 
where a (strongly connected) component of polynomial growth dominates asymptotically. 

• Condition IC2I can be assumed without loss of generality since the weighted distribution is 
stable through the multiplication of all weights by a positive constant. 

• Condition lC3l Remember that Condition [Cl] implies that there exist some constants C > 
and (3 > 1 such that ir(w) < C ■ (J,„, n /I3 n for all w € C n . It follows that, for all k > 1, 

fc-i 



fin J '^(fc-lW 
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Consequently the exponential growth factor p~l of ii^k n is such that 

p-J < (/? (fe - 1 ^)" < PiT 1 

and Condition IC3I is a direct consequence of Condition ICll 

4. General theorems 

In the following section we establish general results on non-uniform urn models, which we apply 
to weighted distributions. Let u be a set of urns, m = |u| its cardinality and, for each Ui €E u, let 
Wi be the weight of Ui, and Pi its probability. This defines a probability distribution p = {pi)"L 1 
such that YliLiPi ~ 1 an d f° r * G [+ m ~ l]i Pi < Pi+i- 

4.1. Birthday paradox: First collision. 

Theorem 4.1. Assume there exists t := r(p) such that 

(A) p m -r < 1; 

(B) y/ct2 ■ t — > +00 when m — > 00; 

( C) ^fai ■ t — > when m — > 00; 

Then the waiting time E(B) of the first birthday can be approximated by 



£"i^) = J 2^(1 + o(l)). 



4.1.1. Application to weighted distribution. 

Proposition 4.2. Let be a weighted context-free grammar and C be its associated language, 
satisfying Conditions \Cl[ |C21 and \C3\ Then the weighted distribution induced on C n satisfies the 
conditions (A)\ (B) and \( C)\ of Theorem \4-l\ for any r n :— Uk,n such that 2 < k < 3. Consequently 



the first collision is observed after E[B \ n] — ^/7r/2a2. n (l + o(l)) generations. 

4.2. Coupon collector: Waiting for the full collection. First let us remind that the uniform 
case is covered by the following folklore theorem 13 . 

Theorem 4.3. In the uniform distribution, the waiting time E[Cx] is given by 

(4.1) E[d] = m ■ H m G 0(m • log(m)). 

Theorem 4.4. In a non-uniform distribution and for large values of n, the waiting time E[C n ] of 
the full collection obeys 

(4.2) — < E[C*] < 2 • H m ■ — 

Pi Pi 

where p\ is the smallest probability of an urn. 

Proof. First let us point out that, for any urn u, the waiting time of the full collection is greater 
than the expected time when a first ball reaches u. Since the least probable urn has probability pi, 
then the lower bound on E\C^] immediately follows. 

From a recent contribution by Berenbrink and Sauerwald [2], we know that the waiting time 
i?[Cy for the full collection of m items drawn with respective probabilities p\ < P2 < . . . < p m can 
be approximated within a O (log log m) factor by an estimate 

(4.3) Urn =Y—. 
More precisely it is shown in [2] that 

(4.4) Um < E[C„] < 2U m - 

3e • log log m 
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Figure 1 . Plots of p\ ■ U m for weighted Motzkin words exhibit a linear growth on 
n, suggesting that the upper bound is reached. 



In our urn model, equation 14.31 specializes into 

where Aj := Pi/p\. Since p± is the weight of the least probable urn, then one has Aj > 1, Vi G [1, to], 
and therefore the following upper bound holds 




»=i 

in which one recognizes the upper bound of Equation 14.21 □ 

Experiments suggest that the upper bound is in fact reached. For instance, Figure [1] shows the 
value pi ■ U m for weighted Motzkin paths, where a weight W > 1 is associated to horizontal steps, 
while up and down steps remain unweighted. In such a case the growth of p\ ■ U m appears to be 
linear with different slopes depending on the parity of n. This phenomenon is due to the fact that 
the minimal number of horizontal steps in a Motzkin word of length n is (resp. 1) for even (resp. 
odd) lengths, leading to minimal weights of 1 for even lengths and tt to odd ones. 

4.3. Occupancy analysis. Figuring out the average state after k generations turns out to be 
easier that the inverse problem - finding expected number k of generations before a given state is 
observed. We refer to a survey [16] by one of the authors for examples of urns model in the context 
of the analysis of algorithms. Here we establish a general formula for the cumulated weight in a 
weighted urn model through a generating function analysis. 

Theorem 4.5. The total weight E[Wk] of occupied urns after throwing k balls is given by 

m 

(4.5) E[W k ]=J2 w i- (l-(l-Pi) feS 

i=l 

Proof. Consider the bivariate generating function 



■3.t 
j>0 k>0 
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where aj >k is the probability of reaching a set of urns having cumulated weight equal to j upon 
throwing fc balls. Remark that such random allocations can be reinterpreted as sequences of to urns, 
each urn Ui containing either at non-empty set of balls (associated with a x Wi (e PiV — 1) contribution) 
or no ball (y = 1). Consequently the generating function ^(cc,?/) can be reformulated as 

m 

**(x, y) =IJ (l + x w *(e?"-l)). 
»=1 

The generating function for the expectation of weight is then classically obtained through a partial 
derivative on x. 



E[W k ] = 



fc! 

71 



Pi = 

OZ 



( 


-yk- 


e v - 


■ y k- 


e y(i-pi)\ 




fc!_ 




. kl - 





e- y J2 W * ■ (l-e _jm ) 
=i 

m 

^W,-(1-(1- Pi ) fe ) 



□ 



Remark that, upon setting Wi — 1, Equation 14.51 simplifies into E[N k ) of urns reached by at 
least one ball (cf Hwang and Janson 17>, ) , such that 



■0(1) 



E[N k ] =E( 1 -( 1 - = E C 1 - e ^ fc ) 
i=i i=i 

4.3.1. Asymptotic estimates for the coverage. Let us start from the formula 

m m 

*w = E ^ • C 1 - - p^ k ) = E ^ • ( x - e fe - log(i - pi) 

i=l i=l 

Since p.; < 1 for all i G [1,™], then one can use an approximation log(l — pi) = —pi + 0{pf) for 
large values of to, which can be be injected into E to obtain 

m 

E[W k ]^Y, W ^( 1 ~ ek ^ P ' +0(P ' )) 



If • p m G o(l), then fc • < fc • p rn G o(l) for all i G [1,to], and therefore e k ( P i+ °(Pi)) 
I — fc^ + O(kpf), which gives 

m m / m \ 

(4.6) E[W k ] = J2 Wi (k Pi + 0{kpj)) = kJ2 WiPi + (kJ2 Wi ■ p 2 i . 

i=l i=l \ i=l / 

In weighted languages that satisfy Condition IC 11 there exists /3 > 1 such that pi G 0(/3~"), for all 
i G [1, m]. Consequently, for any fc G o(/3"), one has 

m 

#[W fe ] = kY,W lPl (1 + 0(/T n )) = fc • Mvr.n • a 2 ,„ (1 + 0(/T™)) . 
i=l 

5. Application to the statistical sampling of RNA 

5.1. Motivation. Random generation has recently found a novel application in the in silico pre- 
diction of RNA folding. Namely a state-of-the-art method [5] for predicting the functional folding 
of a given RNA sequence uses a non- uniform random generation scheme [S]. This method aims 
at predicting the functional, or native, secondary structure of an RNA, a coarse-grain represen- 
tation of the three-dimensional conformation. Based on the observation that the native structure 
is not necessarily that of lowest free-energy, Ding et al used a model initially proposed by Mc 
Caskill [TS], and hypothesized a Boltzmann distribution based on the free-energy over the set of 
possible conformations. Their method generates a representative set of 1000 secondary structures 
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Figure 2. Secondary structure (Left) of a transfer RNA (tRNA) and its equivalent 
representation as a Motzkin walk (Bottom-right). Top-right: Typical picture of 
the Boltzmann ensemble, i.e. a set secondary structures compatible with the RNA 

E s 

sequence, colored according to their respective Boltzmann factor e^r . 



using a statistical sampling algorithm [9]. These structures are then clustered and a consensus 
structure is extracted. Considering this consensus led to a better sensibility/specificity tradeoff 
than previous approaches based on free-energy minimization |25j . 

However, given the variability in length and sequence composition of real RNAs, the 1000 struc- 
tures criterion seems somewhat arbitrary and may lead to irreproducible observations in the context 
of highly variable observables. On the other hand, the sampled sets of structures might feature a 
large level of redundancy. Our theorems provide useful tools for a quantitative characterization of 
such situations. 

5.2. Statistical sampling of RNA secondary structures. An RNA sequence can be encoded 
by a sequence of bases A, C, G and U where local compatibility rules (A-O-U, Af>U, and GoU) 
allow for a folding, i.e. a formation of chemical bounds between pairs of bases. The RNA secondary 
structure constitutes a restriction of all possible base-pairings, where each base is involved in at 
most one base-pairs with the additional constraint that the induced matching does not feature 
crossing interactions. A simplified energy model of Nussinov [19] assigns free-energies contributions 
Eb between —3.0 and —1.0 KCal.Mol -1 to each base-pairs b, depending on the number of hydrogen 
bonds involved in the interaction. The total free-energy E s — Xl&es ^ b °^ a secondary structure s is 
then inherited additively, and each secondary structure s is drawn with probability proportional to 
its Boltzmann factor where R is the perfect gaz constant and T the temperature in Kelvin. 

5.3. Statistical sampling as a weighted generation. Let us first remind that Motzkin words 
are well-parenthesized words featuring any number of dots characters •. Let us define a peak as 
an occurrence of a motif ( ), and a /c-plateau as an occurrence of a motif ( » k ), k > 0. Let 

9 G N be a parameter, then one defines secondary structures as peakless Motzkin words, or more 
generally as Motzkin words that are free of i-plateaux, for any t < 8. The correspondence between 
coarse-grained conformations and Motzkin words is illustrated in Figure [51 Each pair of matching 
parentheses represents a base-pair, and the 9 constant models steric constraints and is typically set 
to 1 in combinatorial studies [23] and to 3 in most RNA folding software. Through an adaptation 
of Viennot et al |22j , secondary structures can be generated from a non-terminal S using rules 

S -> (S> e )S \mS\e S> 6 ^ (S> e )S | • S>g | . e . 

5.4. Expected times for first collision and full collection. Assuming a standard homopolymer 
model, in which any pair of base can bind, statistical sampling is equivalent to a weighted random 
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FIGURE 3. Expected coverage (Top) and proportion of distinct words (Bottom) 
within sampled set of words of various length, considering different values for 9 
and E the free-energy contribution of a base-pair. 



generation, taking w :— as the weight of any base-pair b (e.g. any occurrence of a opening 
parenthesis). The resulting weighted generating function is then given by 

_ 1 - 2z + (w + \)z 2 - wz e + 2 - y / A^ 
bwAZ) ~ (1 - z)2z 2 

A w . e :=1 - 4z + (6 - 2w)z 2 + 4(w - l)z 3 + (w - 1) V 

- 2wz e + 2 + 4wz e + 3 - 2w(l + w)z 6+A + w 2 z 29+i . 

Using our formulae, one can get estimates for the waiting times E[B Ui s,e] and E[C n ,e,E] for the 
first collision and full collection respectively, and observes the following behaviors 



0.64-4.33" , ^ 1.24-4.33" 0.065-12.65" ^ , ^ 0.11-12.65" 



< £[C„,i,_i] < ^ ^ < E[C, 



i— i— ~ — L~n,3,-3J /— 



< 



First one sees that the nature of these growths is unaffected by a change of weights and/or values 
of 9. This is not really surprising, since the grammar is strongly connected and therefore always 
gives rise to generating functions whose singularities are of square- root type |10j . However the 
exponential growth factor is strongly affected by these variations with practical consequences. For 
instance considering tRNAs (n = 80) and using our first order approximation gives a time of first 
collision of ~ 4.7. 10 13 samples in the (9 — 1,E = —1) model, while only ~ 93.55 samples are 
required in the (9 = 3, E = —3) model for the first collision to occur. 

5.5. Collisions and coverage. Finally let us address the coverage and number of distinct samples 
obtained by a random generation scenario. Remark that RNA secondary structures of length n with 
k plateaux are in bijection with Motzkin words of length n~ k9 with k peaks/plateaux, where the 
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bijection simply consists in removing the first 9 horizontal steps of each plateau in every secondary 
structure. Let us further remind that Dyck words with k peaks and 2i letters are counted by the 
Naranaya numbers N(i,k), and that Motzkin words are obtained by inserting some dots within 
a Dyck word. It follows that the number s n ^,i,e of secondary structures of length n featuring i 
plateaux and k > i base-pairs is such that 

<") - *«• *> (. " ; .«) = KQG-i) C " ; - ») 

Using the above formula, one can compute exactly in polynomial time the expected coverage from 
Theorem 13.41 and the proportion of distinct samples from Theorem 13.31 and one obtains the results 
summarized in Figure |31 Interestingly Figure [3] shows that the inevitable decay of the coverage can 
be delayed by free-energies contributions of large absolute values. For instance a sampled set of 1000 
structures still achieves a 50% coverage for RNAs of length < 30 for a free-energy contribution in 
the (9 = 3, E = —3) model while yielding a negligible coverage in the (6 = 1, E = —1) model. This 
suggests that, for highly stable RNAs (having low free-energy) of modest size, the 1000 structure 
criterion might be sufficient. Also a symmetry of the coverage and proportion can be observed, 
although the amplitude of the oscillations for 9 — 3 seem to have less of an impact on the proportion 
of distinct words than on their coverage. 

6. Conclusion and perspectives 

In this article, we investigated the redundancy of random sets of words of context-free languages 
drawn with respect to a weighted distribution. Using a random allocation model we derived exact 
and/or asymptotic equivalent forms for: the expected numbers of generations prior to the first 
collision and full collection, the average proportion of distinct words within a sampled set of k words 
and its cumulated probability. Interestingly, the second moment of the probability distribution 
both appears in the asymptotic behaviors of the first collision and the expected coverage. We 
applied these theorems to analyze the output of a statistical sampling algorithm used to predict 
the functional folding of RNA molecules. We showed that, although the time of first collision is 
exponential on the length of the RNA, its exponential factor strongly depends on the free-energy 
contribution of base-pairs, and may still allow for frequent collisions for RNAs of small - yet relevant 
- lengths. 

Future directions for this work first include a better characterization of the full collection waiting 
time. Namely we showed that, unsurprisingly, the waiting time is dominated by the overall (expo- 
nential) weight but obtained lower and upper that are still separated by a 0(n) factor. A possible 
direction for a tighter bound resides in algebraic manipulations of Harmonic numbers coupled with 
additional assumptions on the distribution of weights (i.e. distribution of symbols), for which local 
limit theorems are known to hold under certain hypotheses. Also we may refine our analysis of 
RNA statistical sampling, using more sophisticated - yet still context-free - grammars in order to 
accommodate more realistic models for the frcc-cncrgy. 
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Appendix 

Proof. (Theorem 14. 1[) From [13] . the waiting time for the first birthday can be expressed as 

r + oo rn 

E(B)= X^e^dt, with X(t) = TT(1 + Pi t). 

J o i=l 

Let us approximate this integral under the conditions of the theorem. We cut the integral at r, 
and independently consider the part from to t, which we expect will be dominating, and the part 
from t to +oo, which should give rise to a negligible contribution. 
Let us first approximating the integral J Q X(t)e^ t dt. Consider 

m 

ip(t) = log A(i) -t = J2 Ml + Pit) - t- 



i=l 



Then E{B) — J °° e^^dt. Let us consider a positive real value t < r such that any value p4 i 



is 



uniformly bounded by some A < 1, then log(l + pit) = Pi t — pf t^2 + 0{p 3 t 3 ), where the bound 
implied in the O(-) term is uniform in pit. Summing over the whole distribution gives 



then ip(t) — — a 2 i 2 /2 + 0(a 3 t 3 ) — -a 2 t 2 /2 + 0(a 3 T 3 ). Plugging this into the integral gives 



This last integral is computed by a change of variable u = t^a%. Approximating with a Gaussian 
•+ 



integral / + °° e u l 2 du = \J n/2 finally gives 



\{t)e-Ht = J^- (l + O (e"^/ 2 ) + O (a 3 r 3 )) . 

Of course, the validity of this expansion requires that 

• The error terms in the above equation are o(l): This follows from our assumptions on t, 
reminding that a 3 r 3 — > (Condition [(C)]) and a 2 r 2 — > oo (Condition (B) I. 

• Each of the terms pit is uniformly bounded by some A < 1: Since p m is the greatest 
probability, then it suffices that p m r < A < 1 (Condition [(A)]. 



Let us bound the value of the remainder J t A(t)e t dt. We factor out the term A (r)e T 
which we expect to be dominant. The remaining term is 



e 



A(r) y A(r) 

f+oo m 



/• + 00 / 



Now, for any positive x, log(l + x) < x which gives a bound 

m / \ / m \ 



s 
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with B(t) = E™, (ijfc + Pi) = £, ife. K follows that 



and finally 

A(r)e" T 



AID -J ' /J(r) 

A(t)e~*cft; < 



B(r) 

Let us consider the order of B(t). We easily check that, for each i, 

Q<1- Pi t< — ^ < 1, 

thus ^ 

< p 2 T - V \t 2 < V f < p? T , 

1+PiT 

which gives bounds on B(t) as 

a 2 r — a^T 2 < B(t) < a 2 r. 
Finally, we can bound the error term. In order to conclude, we need to show that 

X(r)e- r /B(T) = o(l/V^). 
First rewrite the last condition as e _Q2T l 2 +°( a ^ T ) = (B (t) / \f0i2) , taking advantage of A(r) = 
e r-a t /2+o(a 3 r 3 )_ Assume that we have chosen r such that a 2 r — » +00 and o^r 3 — > 0; then B{t) 
has exact order a 2 r and the condition collapses to e~ Q2T / 2 +°( Q 3T ) _ c^Ty^a^), which is trivial. □ 

Proof. (Theorem 14. 2 1) Let us first remind that the exponential order 14 of a sequence /„, is a 
simple exponential function K n such that 

lim sup I /J 1 /" =K. 

n— >oo 

Following notations of the Flajolet/Sedgewick's book [14], we make use of the bowtie notation, and 
write /„ cxi K n if /„ has exponential order K n . It is a classic result [HI Theorem IV. 7] that the 
dominant singularity p of a generating function determines the exponential order of its coefficients 
c„, namely through c„ txi p~ n . 

Since p w < p v k holds for any k > 1 and ttq > 1 (Condition IC3|) . then it follows that 

(6.1) s n . k := ^/p^, n x (-(/Kg) and > p„ 

for ttq any vector of weights strictly larger than 1, and p^ the dominant singularity of L 7r (z). 
This result generalizes to any pair (o, b) £ M 2 of numbers such that 1 < a < b. Indeed, upon 

taking t:q = n a and k — b/a in the above equation, one has s n . a X {tfPir a ) Sn,6 X (\/Pir b ) ) 
and it follows from Condition IC3I that 



(6.2) <fp7«<yp^- 

Consequently, for any 1 < a < b, s n , a grows exponentially faster than s n ^b, and one can use such a 
hierarchy to squeeze r^ 1 between y/a 2 ^ n and ^/a^. n . 
Namely let us consider 



1 



for some k g Q such that 2 < k < 3. Then we have 



. Mtt 2 n 1./ pTT,n k ^n,2 / \l Pn k 

y/a^ ■ r n = ^/ - — ^ {/ = cxi ' 



Ptt,u y pir k ,n Sn,k \ \/ Ptt 2 

and it follows from p^i < ' p^k that 

lim ^a 2 ,„ • r n = +00 



14 DANIELE GARDY AND YANN PONTY 



and consequently Condition (B) is satisfied by our candidate r n . 



Reciprocally for Condition (C) one has 



and, since vV/m > t/p^h, then 



Sn,3 



IX] 



<PTT k 



lim ^/a 3 ,„ • r„ = 0. 

n— y oo 

Condition |(A)| is also satisfied by r„ upon observing that 



A u k 

7r,n / p7r,n 



A 



Pit ,n 1/ PfT k . 



where is the weight of the heaviest (i.e. most probable) word w € £ n . This word is also 

contributing to p,^ n = J^wec ft( w ) an d therefore 



,fe y/ p7T k ,ri 



\ 



A 



which suffices to prove that Condition |(A)| is satisfied. Consequently, the preconditions of Theo- 
rem 14.11 are satisfied by any weighted distribution. □ 
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